CLBlast: A Tuned OpenCL BLAS Library
نویسنده
چکیده
This work demonstrates how to accelerate dense linear algebra computations using CLBlast, an open-source OpenCL BLAS library providing optimized routines for a wide variety of devices. It is targeted at machine learning and HPC applications and thus provides a fast matrix-multiplication routine (GEMM) to accelerate the core of many applications (e.g. deep learning, iterative solvers, astrophysics, computational fluid dynamics, quantum chemistry). CLBlast has four main advantages over other BLAS libraries: 1) it is optimized for and tested on a large variety of OpenCL devices including less commonly used devices such as embedded and low-power GPUs, 2) it can be explicitly tuned for specific problem-sizes on specific hardware platforms, 3) it can perform operations in halfprecision floating-point FP16 saving precious bandwidth, time and energy, 4) and it can combine multiple operations in a single batched routine, accelerating smaller problems significantly. This paper describes the library and demonstrates the advantages of CLBlast experimentally for different use-cases on a wide variety of OpenCL hardware.
منابع مشابه
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming
In this work, we evaluate OpenCL as a programming tool for developing performanceportable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide ...
متن کاملAccelerating Deep Neural Networks on Low Power Heterogeneous Architectures
Deep learning applications are able to recognise images and speech with great accuracy, and their use is now everywhere in our daily lives. However, developing deep learning architectures such as deep neural networks in embedded systems is a challenging task because of the demanding computational resources and power consumption. Hence, sophisticated algorithms and methods that exploit the hardw...
متن کاملA portable and high-performance matrix operations library for CPUs, GPUs and beyond
High-performance computing systems today include a variety of compute devices such as multi-core CPUs, GPUs and many-core accelerators. OpenCL allows programming different types of compute devices using a single API and kernel language. However, there is no standard matrix operations library in OpenCL for operations such as matrix multiplication that works well on a variety of hardware from mul...
متن کاملA compiler toolkit for array-based languages targeting CPU/GPU hybrid systems
This paper presents a compiler toolkit that addresses two important emerging challenges: (1) effectively compiling dynamic array-based languages such as MATLAB, Python and R; and (2) effectively utilizing a wide range of rapidly evolving hybrid CPU/GPU architectures. The toolkit provides: a high-level IR specifically designed to express a wide range of arraybased computations and indexing modes...
متن کاملImplementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Report Date: Written Language: Any Other Identifying Information of this Report: Distribution Statement: Supplementary Notes: The University of Aizu Aizu-Wakamatsu Fukushima 965-8580 Japan 10/22/2014 English First Issue: 10 copies Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Level-3 BLAS, GPU, multi-core CP...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1705.05249 شماره
صفحات -
تاریخ انتشار 2017